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Abstract 


In this note, we comment on the relevance of elicitability for backtesting risk mea¬ 
sure estimates. In particular, we propose the use of Diebold-Mariano tests, and show 
h ow they can be implemen ted for Expected Shortfall (ES), based on the recent result 
of iFissler and Ziegell ( 2015 ) that ES is jointly elicitable with Value at Risk. 


There continues to be lively debate about the appropriate choice of a quantitative risk 
measure for regu l atory purp oses or in t ernal risk management. In this context, it has been 
shown by Weber ( 2001)1 1 and Gneitind (2011) that Expected Shortfall (ES) is not elicitable. 
Specifically, there is no strictly consistent scoring (or loss) function 5 : —)• M such that, 
for any random variable X with finite mean, we have 


ESq,(A) = argmingg]RlE[S(e,X)]. 

Recall that ES of X at level a € (0,1) is defined as 

ESJX) = - [ VaR/3(X)d/3, 

« Jo 

where Value at Risk (VaR) is given by VaRQ(V) = inf{x G M: P(V < x) > a}. In contrast, 
VaR at level a G (0,1) is elicitable for random variables with a unique a-quantile. The 
possible strictly consistent scoring functions for VaR are of the form 


Sv{v,x) = (l{a; < u} — a)(G(u) — G{x)), 


( 1 ) 


where G is a strictly increasing function. 

However, it turns out that ES is elicitable of higher order in the sense that the pair 
(VaRa, ESq,) is jointly elicitable. Indeed, we have that 

{XaRaiX),ESaiX)) = argmin(^ g)gR2 E[S’y,£;(t’,e, V)], 
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where possible choices of Sv,e are given by 


Sv,Eiv,e,x) = (l{x < - a)(Gi(u) - Gi(x)) 


( 2 ) 


H—G2(e)l{x < ?;}(u - x) + G 2 {e){e - v) - 02(e), 
a 

with Gi and G 2 being strictly increasing continuously differentiable f unctions such that 
the ex pectation E[Gi(X)] exists, \\m.^^-ooG 2 {x) = 0 and Q 2 = G 2 ] see lFissler and Ziegel 
(j2015l . Corollary 5.5). One can nicely see the structure of Sv,e'- The first summand in 
([2]) is exactly a strictly consistent scoring function for VaR^ given at ([T]) and hence only 
depends on u, whereas the second summand cannot be split into a part depending only 
on V and one depending only on e, respectively, hence illustrating the fact that ESq, 
itself is not elicitab l e. A possible choice for Gi and G 2 is Gi{v) = v and G 2 {e) = exp(e). 
Acerbi and Szekelv ( 2014li proposed a scoring function for the pair (VaR^, ES^) under the 
additional assumption that there exists a real number w such that ESa(^) > tc VaRQ(X) 
for all assets X under consideration. Despite encouraging simulation results, there is 
currently no formal proof available of the strict consistency of their proposal. In contrast, 
the scoring functions given at ([2]) do not require additional assumptions, and it has been 
formally proven that they provide a class of strictly consistent scoring functions. 

The lack of elicitability of ES (of first order) has led to a lively discu ssion about whether 
or not and how it is p ossibl e to backtest ES fore casts; see, for example, lAcerbi and Szekelv 
( 2014 b Carver (2014), and Emmer et al.l (2015). It is generally accepted that elicitability 
is useful for model selection, estimation, generalized regression, forecast comparison, and 
forecast ranking. Having provided strictly consistent scoring functions for (VaRa, ESq), 
we take the opportunity to comment on the role of elicitability in backtesting. 

The traditional approach to backtesting aims at model verification. To this end, one 
tests the null hypothesis: 




“The risk measure estimates at hand are correct.’’ 


Specifically, suppose we have sequences {xt)t=i,...,N and {vt,et)t=i,...,N, where xt is the 
realized value of the asset at time point t, and vt and e* denote the estimated VaR^ and 
ESq given at time t — 1 for time point t, respectively. A backtest uses some test statistic 
Ti, which is a fuirction of {vt,et,xt)t=i,...,N, such that we know the distribution of Ti (at 
least approximately) if the null hypothesis of correct risk measure estimates holds. If we 
reject at some small level, the model or the estimation procedure for the risk measure 
is deemed inadequate. For this approach of model verification, elici tabili ty of the risk 
measure is not relevant, as pointed out by Acerbi and Szekelv ( 20141 1 and DavisI ( 20141 1. 
However, tests of this type can be problematic in regulatory practice, notably in v i ew of 
the anticipated revised standardised approach ( Bank for International Settlements . 201,11 . 
pp. 5-6), which “should provide a credible fall-back in the event that a bank’s internal 
market risk model is deemed inadequate”. If the internal model fails the backtest, the 
standardised approach may fail the test, too, and in fact it might be inferior to the internal 
model. Generally, tests of the hypothesis Hq are not aimed at, and do not allow for, model 
comparison and model ranking. 
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Figure 1: Decisions taken in comparative backtests under the null hypotheses Hq and Hq 
at level 0.05. In the yellow region the approaches entail distinct decisions. 


Alternatively, one could use the following null hypothesis in backtesting: 

“The risk measure estimates at hand are at least as good 
' as the ones from the standard procedure.” 

Here, the standard procedure could be a method specified by the regulator, or it could 
be a technique that has proven to yield good results in the past. Specifically, let us write 
{v^ for the sequence of VaRo and ES^ estimates by the standard procedure. 

Making use of the elicitability of (VaRo, ESq,), we take one of the scoring functions Sv,e 
given at ([2]) to dehne the test statistic 


To = 


Sv,E ~ Sv,E 


aN 


( 3 ) 


where 


N 


Sv,E = j;j'^Sv,Eivt,et,xt), 5i 


V,E 


t=i 


1 

N 


N 


^Sv,E{vhel,xt), 


t=i 


and On is a suitable estimate of the respective standard deviation. Under Hq , the 
test statistic T^i has e xpect ed value less than or equal to zero. Eollowing the lead of 
Diebold and Marianol ( 1995l i. comparative tests that are based on the asymptotic normal¬ 
ity of the test statistics T 2 have been employed in a wealth of applications. 

Under b oth Hn and H n , the backtest is passed if the null hypothesis fails to be rejected. 
However, as Eisher ( 1949I . p. 16) noted, “the null hypothesis is never proved or established, 
but it is possibly disproved, in the course of experimentation.” In other words, a passed 
backtest does not imply the validity of the respective null hypothesis. Passing the backtest 
simply means that the hypothesis of correctness {Hq) or superiority {Hq), respectively, 
could not be falsified. 

In the case of comparative backtests, a more conservative approach could be based on 
the following null hypothesis: 



“The risk measure estimates at hand are at most as good 
as the ones from the standard procedure.” 


This can also be tested using the statistic T 2 in ([3|), which has expected value greater than 
or equal to zero under Hq . The backtest now is passed when Hq is rejected. The decisions 
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taken in comparative backtesting under Hq an d are illustrated in Figure [H where the 


colors relate to the three-zone approach of the Bank for International Settlement ( 20131 . 
pp. 103-108). In regulatory practice, the distinction between Diebold-Mariano tests un¬ 
der the two hypotheses amounts to a reversed onus of proof. In the traditional setting, 
it is the regulator’s burden to show that the internal model is incorrect. In contrast, if a 
backtest is passed when Hq is rejected, banks are obliged to demonstrate the superiority 
of the internal model. Such an approach to backtesting may entice banks to improve their 
internal models, and is akin to regulatory practice in the health sector, where market 
authorisation for medicinal products hinges on comparative clinical trials. In the health 
context, decision-making under Hq corresponds to equivalence or non-inferiority trials, 
which are “not conservative in nature, so that many flaws in the design or conduct of the 
trial will tend to bias the results”, where as “efficacy is most convincingly established by 


demonstrating superiority” under Hq (lEuropean Medicines Agencvi . Il998l . p. 17). Tech¬ 


nical detail is available in a specialized strand of the biomedical literature; for a concise 
Lesaffre ( 2008l b 


review, see 

We now give an illustration in the simulation setting of lGneitinEf et al.l ([20071). Specif¬ 
ically, let (/ii)t=i,,,,,Af be a sequence of independent standard normal random variables. 
Conditional on /it, the return Xt is normally distributed with mean /it and variance I, 
denoted AA(/it,I). Under our Scenario A, the standard method for estimating risk mea¬ 
sures uses the unconditional distribution AA(0,2) of At, whereas the internal procedure 
takes advantage of the information contained in /it and uses the conditional distribution 
AA(/it,I). Therefore, 


(i;t,et) = (VaRa(W(/it, I)),ESa(W(/it, I))) = (/it-7 $ ^(a),/it - - <t9($ ^(a)) 

a 


and 


«,en = (VaR„(W(0,2)),ES„(W(0,2))) = 


V2^ ^(a), -^ (/?(<!) ^{a)) 


where ip and <I> denote the density and the cumulative distribution function of the standard 
normal distribution, respectively. Under Scenario B, the roles of the standard method and 
the internal procedure are interchanged. 

We use sample size N = 250 and repeat the experiment 10,000 ti mes. As tests of tradi- _ 

tional type, we consider the coverage test for VaRo.oi described by the iBank for Inte rnation al Settlernents 

S pp. I03-I08) a nd the gene r alized coverage test for ESo,o 25 proposed bv ICostanzino and Curran 
I. As s hown by Clift_gLaL (2Q15|) , the latter performs similarly to the approaches of 
Wong (2008) and Acerbi and Szekelvl ( 20I4l i. but is easier to implement. The outcome of 
the test is structured into green, yellow, and red zones, as described in the aforementioned 
references. Eor the comparative backtest for (VaRo,o 25 ) ESo.ogs), we use the functions 
Gi{v) = V and G 2 (e) = exp(e)/(I -|-exp(e)) in (I2|) and define the zones as implied by 
Eiguredl Einally, our comparative backtest for VaRo.oi uses the function G{v) = u in ([T]), 
which is equivalent to putting Gi{v) = v and G 2 (e) = 0 in (|2]). Eor ajsf in the test statistic 
Eg in dSj) we use the standard estimator. 

Table [1] summarizes the simulation results under Scenario A and B, respectively. The 
traditional backtests are performed for the internal model in the scenario at hand. Under 
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Table 1; Percentage of decisions in the green, yellow, and red zone in traditional and 
comparative backtests. Under Scenario A, a decision in the green zone is desirable and in 
the joint interest of banks and regulators. Under Scenario B, the red zone corresponds to 
a decision in the joint interest of all stakeholders. 


Scenario A 


Green 

Yellow 

Red 

Traditional 

VaRo.oi 


89.35 

10.65 

0.00 

Traditional 

ESo.025 


93.62 

6.36 

0.02 

Comparative 

VaRo.oi 


88.23 

11.77 

0.00 

Comparative 

(VaRo,o25 

ESo.025) 

87.22 

12.78 

0.00 

Scenario B 



Green 

Yellow 

Red 

Traditional 

VaRo.oi 


89.33 

10.67 

0.00 

Traditional 

ESo.025 


93.80 

6.18 

0.02 

Comparative 

VaRo.oi 


0.00 

11.77 

88.23 

Comparative 

(VaRo.025 

ESo.025) 

0.00 

12.78 

87.22 


Scenario A, the four tests give broadly equivalent results. The benefits of the comparative 
approach become apparent under Scenario B, where the traditional approach yields highly 
undesirable decisions in accepting a simplistic internal model, while a more informative 
standard model would be available. This can neither be in banks’ nor in regulators’ 
interests. We emphasize that this problem will arise with any traditional backtest, as 
a traditional backtest assesses optimality only with respect to the information used for 
providing the risk measure estimates. 

Comparative tests based on test statistics of the form T 2 in ([3j) can be used to com¬ 
pare forecasts in the form of full p redictive distributions, provided a proper scoring rule is 
used ( Gneiting and Rafterv . 2007l i. or to compare risk assessments, provided the risk mea¬ 
sure admits a strictly consistent scoring function, so elicitability is crucial. In particular, 
proper scoring rules and co nsistent scoring functio n s are sensitive to increasing information 
utilized for prediction; see Holzmann and Enlert ( 20141 1. However, as consistent scoring 
functions are not unique, a question of prime practical interest is which functions ought 
to be used in regulatory settings or internally. 

Arguably, now may be the time to revisit and investigate fundamental statistical issues 
in banking supervision. Chances are that comparative backtests, where a bank’s internal 
risk model is held accountable relative to an agreed-upon standardised approach, turn out 
to be beneficial to all stakeholders, including banks, regulators, and society at large. 
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